Stack-Driven Masking for Grammar-Constrained Decoding (GLR + Weighted Automata)

This post is intentionally a “foot-in-the-door” writeup: enough detail to communicate the core idea (and timestamp it), not a full implementation guide. I’ll likely follow up with the compile-time tricks later.

Executive summary

Grammar-constrained generation has two very different problems hiding under one name:

commit(token): update the constraint state after you sampled a token. This is incremental parsing.
get_mask(state): given the current constraint state, return the set of next LLM tokens that keep you valid. This is the hard part on the critical path.

For commit, incremental parsing techniques (LR/GLR + a graph-structured stack) are a good fit.

For get_mask, the usual “simulate candidate tokens through the parser” approach ties runtime to vocabulary size and tokenization ambiguity in ways that are brutal for p99.9 latency.

The idea I want to put on the record:

Invert the problem. Precompile the grammar+tokenizer into a bitset-weighted automaton whose input is the parser stack (or GSS). Then get_mask becomes a single pass that reads stack state IDs and does bitset ∩/∪ operations—no per-vocabulary-token simulation.

Setup

You have a grammar (often “JSON Schema”, though in practice this means the structural subset you can compile into something CFG-like), and you want the LLM to only emit valid outputs.

At each decoding step the model produces logits over the vocabulary $V$ . You sample a token. You append it to the output. Repeat.

Constrained decoding is: don’t sample tokens that would make the output invalid.

So we maintain a constraint state and, for each step, compute a mask $M \subseteq V$ of allowed tokens.

No more broken JSON—at least syntactically.

Two primitive operations

It’s useful to separate constrained decoding into two methods:

get_mask(state) -> bitset[V]
Returns which LLM tokens are valid next from this state.
commit(state, token)
Update the constraint state after that token has been chosen.

At runtime you want a tight loop like:

loop:
  parallel:
    GPU: compute logits
    CPU: compute mask
  apply mask to logits
  sample token
  commit token to:
    - model KV-cache
    - constraint state

This split matters because:

commit is “do work proportional to what we actually emitted” (one token).
get_mask is “do work every step no matter what” (critical path).

Why worst-case latency matters (p99.9, not p50)

Inference servers batch many sequences on the GPU. The CPU-side mask has to arrive “in time” for each decode step, across the whole batch.

If the GPU step latency is ~1ms in your throughput regime, then you need your mask computation to reliably finish within that same window for every active sequence. A single slow mask can stall the whole batch, which cascades into ugly scheduling behavior.

So it’s not “can you do masking fast on average?” It’s “can you do it fast every single step?”

That’s the lens for everything below.

`commit`: incremental parsing is a good mental model

When you call commit(state, token), you are incrementally parsing a growing prefix. This is a well-studied problem.

Glossary (to keep levels straight)

There are three different “alphabets” floating around:

LLM token: an element of the model vocabulary $V$ (e.g. 200k entries). Each has a byte string.
Grammar terminal: the symbols your grammar/parser consumes (could be bytes/chars, or lexer tokens like --, Identifier, etc.).
Parser stack state IDs: LR-style automaton states (integers) stored on the parse stack.

Also:

LR: table-driven parsing; the parser’s control state is an integer “state ID” and it maintains a stack of those IDs.
GLR: handles ambiguity by representing multiple possible stacks at once via a graph-structured stack (GSS).

Why GLR/GSS shows up

If your constraints have nondeterminism (common once you compile real schema-ish structures, alternatives, optionality, “oneOf”-style branching, etc.), you often end up with “multiple parses remain viable so far.”

The GLR family’s core trick is: don’t backtrack, represent the ambiguity compactly. You keep a set of stack “heads” in a shared graph.

This tends to work well for commit because:

Updates are local and table-driven.
The GSS can share most structure between alternatives.
It’s designed to be updated incrementally.

So far so good.

`get_mask`: this is where things get ugly

Now you have a parse state (often a GSS), and you want a mask over LLM tokens.

A straightforward approach is:

For each LLM token $t$ , check whether appending it keeps the parse valid.

But you don’t actually append “an LLM token” to the parser. You append:

the token’s byte string,
which is tokenized/lexed into grammar terminals (sometimes with ambiguity),
which drive shifts/reductions in the parser (sometimes many reductions between terminal consumptions),
across potentially many GSS heads.

Doing that per vocabulary token is a non-starter.

The tokenization ambiguity trap (why the Fibonacci story matters)

Here’s the precise point I want to make with the “112 dashes” anecdote:

Even if you don’t like the specific tokenizer example, incremental lexing can be ambiguous, and ambiguity can explode.

In a toy lexer where the only terminals are - and --, a run of $N$ dashes has $F_{N+1}$ segmentations (Fibonacci growth). For $N=112$ , that’s on the order of $10^{23}$ ways to segment.

Why does this matter to masking?

Because the naive masking algorithm effectively asks:

“Is there any way to segment this LLM token’s bytes into terminals such that the parser can consume them from the current state?”

If you handle “token → terminal sequence(s)” by exploring a tokenization trellis, then in the worst case you’re exploring an exponential number of segmentations per candidate LLM token.

Most practical systems avoid the full explosion with tries/trellises and memoization. That helps. But it doesn’t change the fundamental shape of the problem: you’re still trying to answer a question about all 200k tokens by simulating “what if we took this token?”

Why tries/trellises still hurt in p99.9 land

A common direction is:

Build a trie/trellis representing how LLM tokens map to terminal sequences.
Traverse that trie while advancing the parser state, pruning invalid paths.
Collect the set of LLM tokens that survive.

This can work well for many grammars and many workloads. My claim is narrower:

For general CFG-ish constraints + large vocabularies + strict worst-case deadlines, this approach has pathological cases that are hard to engineer away.

The big practical pain points are:

Reductions are “invisible work”: a single terminal shift can trigger many reductions. In GLR, those reductions can fan out.
Pointer-heavy operations: GSS traversal/manipulation is branchy and cache-unfriendly compared to tight bitset ops.
Worst-case variability: some steps are cheap; some steps hit deep reduction chains and/or many active heads; those are the steps that break batching.

That’s what I mean by “dead end” in the original draft: not “nobody can make it fast,” but “I couldn’t get a per-token / per-trie-path simulation approach to have predictable p99.9 behavior at scale without the complexity ballooning.”

Invert the problem: read the stack, don’t simulate the vocabulary

The key reframing is:

Token validity depends on the parse stack(s).
So instead of asking “does token $t$ work on this stack?”, ask “given this stack, which tokens work?”

That sounds tautological, but it changes the engineering shape:

“simulate token” naturally loops over $|V|$ or over a large fraction of the token trie.
“read the stack” naturally loops over stack depth / GSS structure, which is usually far smaller and is the actual information that determines validity.

Now we need a data structure that turns “stack → allowed tokens” into something we can execute fast.

That’s where the weighted automaton comes in.

The answer (the idea): bitset-weighted automata over parser stacks

What I mean by “weighted automaton” here

Think of a finite automaton, except each transition carries a weight and traversing paths combines weights.

In my setting:

The weights are bitsets over the LLM vocabulary (allowed-token sets).
Traversing a transition applies an intersection (filter):
“tokens that survive this constraint”.
When multiple paths reach the same automaton state, we union their weights:
“tokens that work via any viable path”.

This is exactly the “semiring” you’d expect on sets:

“multiply” along a path: $\cap$
“add” at merges: $\cup$

The missing link (made explicit): the stack is the input string

The automaton’s input alphabet is the set of LR parser state IDs.

The automaton reads (a representation of) your current parse configuration:

For a single LR stack: the sequence of state IDs from top to bottom.
For GLR: a GSS, i.e. a DAG representing many possible stacks sharing suffixes.

So the automaton is not replacing the parser. It’s a precompiled masking machine:

Input: current stack / GSS (state IDs)
Output: bitset mask over LLM tokens valid next

Intuition: each token corresponds to “a regular language of stacks”

Another way to see it (which helped me):

Fix an LLM token $t$ .
Consider the set of parser stacks from which $t$ can be consumed (taking reductions into account, and respecting tokenization/lexing rules).

That set of stacks is something you can represent with a finite automaton over stack state IDs (a “stack language” of sorts).

If you did that for every $t \in V$ , you’d have 200k automata. That’s useless at runtime.

A weighted automaton is how you smash those 200k membership tests into one run:

Transitions are annotated with “which tokens follow this transition.”
Running it once on the stack gives you “which tokens’ automata accept this stack.”

Runtime sketch

For a single LR stack, the runtime looks like:

def get_mask(stack_state_ids_top_to_bottom):
    # frontier: map automaton_state -> bitset(tokens)
    frontier = { A.start: ALL_TOKENS }

    for sid in stack_state_ids_top_to_bottom:
        new = {}
        for a_state, tokens in frontier.items():
            for (a2, weight) in A.step(a_state, sid):
                tokens2 = tokens & weight            # intersection (filter)
                if tokens2.any():
                    new[a2] = new.get(a2, EMPTY) | tokens2  # union (merge paths)
        frontier = new

        if decided(frontier):
            break

    return combine_accepting(frontier)

For a GSS, you do the same computation but over a graph rather than a single list:

Branching in the GSS corresponds to alternative stacks → union at merges.
Shared suffixes mean you don’t re-walk shared stack tails per alternative.

(Details omitted here, but the point is: it’s the same ∩/∪ propagation pattern.)

Why this is the right kind of work for p99.9

Mask computation becomes:

A scan over the relevant portion of the stack/GSS
With inner loops that are mostly:
- transition-table lookups
- bitset intersections
- unions on merge

Critically:

You’re not iterating over the vocabulary.
You’re not traversing a token trie driven by bytes.
You’re not simulating reductions per candidate token path.

The runtime is driven by what the parse state actually contains.

And you can early-exit when further stack depth cannot change the result (the decided(frontier) hook above), which is common if the automaton reaches a fixed point.

How this fits with GLR (commit + mask together)

Putting it together:

commit(token) uses GLR machinery:
- update the GSS for the actual emitted token
- keep ambiguity compact
get_mask(state) uses the weighted automaton:
- treat the current GSS as the “input”
- propagate token bitsets through the automaton with ∩/∪
- return a vocabulary mask

So you get a split where:

the parser is responsible for correctness of incremental acceptance (commit)
the automaton is responsible for cheaply answering “what next?” (get_mask)

Why I think this is close to the right shape (without overselling “optimal”)

I’m avoiding “this is optimal” as a theorem, but the shape matches what I want in production:

No $|V|$ loop on the critical path.
Work proportional to stack/GSS structure, which is the actual dependency.
Inner loop dominated by bitset ops, which are predictable and SIMD-friendly.
Early termination when the mask stops changing.

You still have worst cases (deep stacks exist), but it’s a worst case that comes from the actual output structure (nesting depth), not from adversarial vocabulary/tokenization effects.

Tradeoffs / scope notes (what this post is not claiming)

“JSON Schema” caveat: full JSON Schema includes semantic constraints (ranges, regexes, uniqueItems, etc.). This post is about the grammar-shaped core. You can layer semantic predicates on top, but that’s a separate axis.
Fast runtime, slow compile: precompiling into a weighted automaton involves determinization/minimization-like steps (in a semiring setting) and can blow up if you’re careless. Most of my engineering time went into making compile-time and memory behave on large grammars. That’s a follow-up.
Memory is real: a 200k-vocab bitset is ~25KB. You cannot naively hang a fresh 25KB bitset off every transition. You need sharing/interning/chunking/sparsity tricks. Again: follow-up.
This doesn’t magically remove ambiguity: GLR ambiguity still exists; the goal is to make mask computation robust in the presence of it.

Where this sits relative to existing constrained decoding libraries

Most constrained decoding systems I’ve seen in the LLM ecosystem do some variant of:

represent valid next characters/bytes/terminals from the current parse state, and
intersect that with a token trie to decide which LLM tokens are compatible.

That can be great for many use cases (especially regular/near-regular constraints like JSON).

The approach I’m describing is different in the “mask” half:

It tries to precompile “stack → token mask” directly,
so get_mask is stack-driven rather than vocabulary-/trie-driven.

I’m not claiming other approaches are “wrong.” I’m saying: if you care about strict worst-case deadlines in a batched inference setting, you want a get_mask whose cost is not dominated by $|V|$ or tokenization pathologies.

Closing

Constrained generation is often sold as “just mask invalid tokens.” That hides the real split:

Incremental parsing (commit) is relatively well served by GLR/GSS techniques.
Mask generation (get_mask) is the true bottleneck if you care about p99.9 latency at scale.

My contribution/claim here is the reframing and the data structure:

Precompile the grammar+tokenizer into a bitset-weighted automaton over parser stack state IDs, and compute masks by reading the stack/GSS once with ∩/∪ bitset propagation.

That’s the idea I wanted to put on the record. If there’s interest, the next post would be about the “slow compile” side: how to build/determinize/minimize these automata without exponential blowups, and how to represent the weights without storing a 25KB bitset per edge.